HEAD
In today’s class we’re going to look at the generalised linear model, which is what we’ll be using for the rest of the module. The GLM allows us to extend the principle of linear regression to different kinds of outcomes using a link function. This is simply an equation that links linear predictors to non-linear outcomes, for instance “yes/no”, counts, etc.
The goal today is more theoretical than usual for our tutorials: we
will introduce the glm() function in R, and try to see how
it works ‘under the hood’ using what is known as maximum likelihood
estimation. MLE is just a way of estimating the parameters of a
distribution from the available data; in certain circumstances, the
ordinary least squares method we used last term satisfies the objective
of MLE.
By the end of today’s class you should:
glm() function in R.glm() functionThe glm() function is very similar to the
lm() function we’re already familiar with. In fact, you can
run a linear regression with the glm() function, because
linear regression is part of the same family as the generalised linear
model.
stargazer::stargazer(lm(Sepal.Length ~ Sepal.Width + Species, data = iris), type = "html")
| Dependent variable: | |
| Sepal.Length | |
| Sepal.Width | 0.804*** |
| (0.106) | |
| Speciesversicolor | 1.459*** |
| (0.112) | |
| Speciesvirginica | 1.947*** |
| (0.100) | |
| Constant | 2.251*** |
| (0.370) | |
| Observations | 150 |
| R2 | 0.726 |
| Adjusted R2 | 0.720 |
| Residual Std. Error | 0.438 (df = 146) |
| F Statistic | 128.888*** (df = 3; 146) |
| Note: | p<0.1; p<0.05; p<0.01 |
stargazer::stargazer(glm(Sepal.Length ~ Sepal.Width + Species, data = iris,
family = "gaussian"), type = "html")
| Dependent variable: | |
| Sepal.Length | |
| Sepal.Width | 0.804*** |
| (0.106) | |
| Speciesversicolor | 1.459*** |
| (0.112) | |
| Speciesvirginica | 1.947*** |
| (0.100) | |
| Constant | 2.251*** |
| (0.370) | |
| Observations | 150 |
| Log Likelihood | -87.968 |
| Akaike Inf. Crit. | 183.937 |
| Note: | p<0.1; p<0.05; p<0.01 |
See how the glm() function, when we specify the
family = argument to “gaussian”, gives us
almost the same output as the lm()
function. This is because linear regression is based on the normal, or
Gaussian, distribution. What do you notice is different though?
stargazer::stargazer(glm(am ~ cyl + disp, data = mtcars,
family = "binomial"), type = "html")
| Dependent variable: | |
| am | |
| cyl | 0.743 |
| (0.719) | |
| disp | -0.028* |
| (0.014) | |
| Constant | 0.799 |
| (1.992) | |
| Observations | 32 |
| Log Likelihood | -14.293 |
| Akaike Inf. Crit. | 34.586 |
| Note: | p<0.1; p<0.05; p<0.01 |
If we want to use glm() to perform logistic regression,
we simply specify the distribution for which logistic regression
estimates the parameters, i.e., the binomial distribution.
Let’s quickly review what we use logistic regression for, and how it works. If we have an outcome that can only be true or false, yes or no, then it follows that our estimate for this outcome needs to be bounded between zero (for false/no) and one (for true/yes). It would make no sense to have an estimate less than zero, or more than 1; moreover, if we did get such an estimate, it would suggest that our model was somehow wrong, and that all our other predictions might be off.
Basically, we need a distribution that falls between zero and one, and we need to be able to link the coefficients of our predictors to this distribution. Here’s the distribution:
x <- 1:1000
plot(pbinom(x, size = 1000, prob = 0.2), type = "s", col = "blue",
main = "Binomial distribution function",
xlab = "Number of successes", ylab = "F(x)")
lines(pbinom(x, size = 1000, prob = 0.4), type = "s", col = "red")
lines(pbinom(x, size = 1000, prob = 0.6), type = "s", col = "green")
legend("bottomright",
legend = c("0.2", "0.4", "0.6"),
col = c("blue","red","green"),
title = "probability")
And here’s the link function:
\[\log\frac{p}{1-p} = \beta
X\]
Remember: all the link function does is convert from a probability (0 to 1 scale), which has a curve shape, to log odds (\(-\infty\) to \(\infty\) scale), which is a straight line. We typically use log odds for \(\beta X\) (literally, the log of the odds) because these are symmetrical and normally distributed. Taking the log of the odds also converts the terms in our model from a multiplicative to an additive relationship (i.e., we add the terms rather than multiply them), which is similar to what we are familiar with from linear regression.
It is easy to get confused in logistic regression. There is a difference between odds and odds ratio, log (odds) and log (odds ratio), the logit link function used to convert between probabilities and log odds, and the log likelihood transform used in maximum likelihood estimation to fit the predicted line. It takes time to familiarise yourself with these. In general, we will focus on how to interpret our model coefficients and how to make a prediction based on them.
Let’s compare a logistic regression model with a linear model we’re
familiar with, using the iris dataset we’re also familiar
with.
dat <- iris
dat$set <- ifelse(iris$Species == "setosa", 1, 0)
mod1 <- lm(set ~ Petal.Length + Petal.Width, data = dat)
stargazer::stargazer(mod1, type = "html")
| Dependent variable: | |
| set | |
| Petal.Length | -0.251*** |
| (0.032) | |
| Petal.Width | 0.010 |
| (0.073) | |
| Constant | 1.266*** |
| (0.044) | |
| Observations | 150 |
| R2 | 0.852 |
| Adjusted R2 | 0.849 |
| Residual Std. Error | 0.183 (df = 147) |
| F Statistic | 421.497*** (df = 2; 147) |
| Note: | p<0.1; p<0.05; p<0.01 |
Here, we ran a linear model on a binary outcome (is the iris of species setosa or not?) Let’s make a prediction from our model based upon some imaginary data: an iris with petal length of 5.4cm and width of 2.4cm
newdat <- data.frame(Petal.Length = 5.4,
Petal.Width = 2.4)
predict(mod1, newdat)
## 1
## -0.0675413
As we can see, our model gives us a prediction below zero - does this make sense?
Let’s try again with logistic regression.
mod2 <- glm(set ~ Petal.Length + Petal.Width, data = dat,
family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer::stargazer(mod2, type = "html")
| Dependent variable: | |
| set | |
| Petal.Length | -17.596 |
| (43,449.070) | |
| Petal.Width | -33.887 |
| (115,850.800) | |
| Constant | 69.447 |
| (43,042.940) | |
| Observations | 150 |
| Log Likelihood | -0.000 |
| Akaike Inf. Crit. | 6.000 |
| Note: | p<0.1; p<0.05; p<0.01 |
So, we get a different set of coefficients for our predictor variables, both of which are now negative… what does this mean? How do we make a prediction? Let’s try with our same imaginary data.
predict(mod2, newdat)
## 1
## -106.8992
Well, that doesn’t seem too useful - we’re even further below zero
than the linear model. Except, if we check the help file for
predict.glm()…
Let’s try again with the argument type = "response":
predict(mod2, newdat, type = "response")
## 1
## 2.220446e-16
What’s the difference? By default we get the log odds prediction; if we want to get the probability from this, we need to reverse the link function, which is -
\[p =
\frac{e^{\log(odds)}}{1 + e^{\log(odds)}}\]
Let’s try this out on a slightly more normal range of data, petal length = 2.3 and petal width = 0.8.
newdat2 <- data.frame(Petal.Length = 2.3,
Petal.Width = 0.8)
predict(mod2, newdat2, type = "response")
## 1
## 0.8661015
log_odds <- predict(mod2, newdat2)
exp(log_odds)/(1+exp(log_odds))
## 1
## 0.8661015
1/(1+exp(-log_odds))
## 1
## 0.8661015
As you can see, we get the same prediction. Interpreting coefficients and predicted values in logit regression is less straightforward than linear regression. We will look into this in greater detail in the weeks to come, but this link has some helpful examples that explain the difference between log odds, odds, and probability, and how we can calculate each with R.
Before we move on, just to drive home the difference between linear and logistic regression, look at the plot below.
ggplot(data = dat, aes(Petal.Length, set)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red")
Here, we’re plotting the probability of an iris being from the setosa species according to the length of its petals. As we can see, the larger the petal, the lower the probability. However, whereas the linear regression line is not bounded by zero on the y axis, and so the line moves negative above roughly 5cm on the x axis, the logit regression is asymptotic to zero and one on the y axis: no matter how large the petal, the prediction does not go below zero.
optim()How exactly does R fit this line? We know how linear regression works: we just square the residuals and find the line that minimizes this number. It turned out there was actually a formula that found this solution without all the faff of experimenting:
\[b =
\frac{\sum(x-\overline{x})(y-\overline{y})}{\sum(x-\overline{x})^2}
\]
\[a = \overline{y}-b\overline{x} \]
Wouldn’t it be great if there was a similar equation for logistic regression? Sadly, there is not. Let’s watch a quick video that explains why.
So, if there’s no equation, how do we work out the best fitting line?
Let’s see how it works in R. We’ll begin by setting a seed for our random data generating functions:
set.seed(2022)
options(scipen = 500)
Next, we’ll create a vector of predictor variables using
rnorm():
x <- rnorm(10000) # 1000 random draws from the normal distribution with mean of 0 and
#sd of 1.
Now, we’ll set the coefficients we want to estimate:
intercept <- 3 # the coefficient of our intercept
slope <- 0.2 # the coefficient of our slope
And finally we want an outcome variable, y, which is binomial (1 or 0). We’re going to make the probability of a 1 or 0 dependent on a combination of the coefficients we just made up.
Because we’re using the binomial distribution, instead of the
parameters being mean and standard deviation, we have probability (check
?rbinom() for details.) This is the third argument, which
is what the MLE will be trying to maximise (estimate). Note the
equation, which is the reverse of our link function, just like we saw
above.
y_binom <- rbinom(10000, 1, exp(intercept + slope*x)/(1+exp(intercept + slope*x)))
# our outcome variable.
Let’s run the glm:
binom_glm <- glm(y_binom ~ x, family = "binomial")
stargazer::stargazer(binom_glm, type = "html")
| Dependent variable: | |
| y_binom | |
| x | 0.213*** |
| (0.047) | |
| Constant | 3.025*** |
| (0.048) | |
| Observations | 10,000 |
| Log Likelihood | -1,894.758 |
| Akaike Inf. Crit. | 3,793.516 |
| Note: | p<0.1; p<0.05; p<0.01 |
What’s going on under the hood here? We can use the
optim() function to find out. optim()
minimises (or maximises) a function. Here, we’ll supply the function we
used to generate y_binom, but with the two coefficients
missing. Compare the results with the glm() function.
f <- function(b) {
p <- exp(b[1] + b[2]*x)/(1+exp(b[1] + b[2]*x))
-sum(dbinom(y_binom, 1, p, log=TRUE))
}
ans <- optim(fn = f, par = 0:1, hessian=TRUE)
ans$par
## [1] 3.0254967 0.2126678
Right, I think that’s enough calculus for today. Next week we’ll look at using logistic regression on some real data.
In today’s class we’re going to look at the generalised linear model, which is what we’ll be using for the rest of the module. The GLM allows us to extend the principle of linear regression to different kinds of outcomes using a link function. This is simply an equation that links linear predictors to non-linear outcomes, for instance “yes/no”, counts, etc.
The goal today is more theoretical than usual for our tutorials: we
will introduce the glm() function in R, and try to see how
it works ‘under the hood’ using what is known as maximum likelihood
estimation. MLE is just a way of estimating the parameters of a
distribution from the available data; in certain circumstances, the
ordinary least squares method we used last term satisfies the objective
of MLE.
By the end of today’s class you should:
glm() function in R.glm() functionThe glm() function is very similar to the
lm() function we’re already familiar with. In fact, you can
run a linear regression with the glm() function, because
linear regression is part of the same family as the generalised linear
model.
stargazer::stargazer(lm(Sepal.Length ~ Sepal.Width + Species, data = iris), type = "html")
| Dependent variable: | |
| Sepal.Length | |
| Sepal.Width | 0.804*** |
| (0.106) | |
| Speciesversicolor | 1.459*** |
| (0.112) | |
| Speciesvirginica | 1.947*** |
| (0.100) | |
| Constant | 2.251*** |
| (0.370) | |
| Observations | 150 |
| R2 | 0.726 |
| Adjusted R2 | 0.720 |
| Residual Std. Error | 0.438 (df = 146) |
| F Statistic | 128.888*** (df = 3; 146) |
| Note: | p<0.1; p<0.05; p<0.01 |
stargazer::stargazer(glm(Sepal.Length ~ Sepal.Width + Species, data = iris,
family = "gaussian"), type = "html")
| Dependent variable: | |
| Sepal.Length | |
| Sepal.Width | 0.804*** |
| (0.106) | |
| Speciesversicolor | 1.459*** |
| (0.112) | |
| Speciesvirginica | 1.947*** |
| (0.100) | |
| Constant | 2.251*** |
| (0.370) | |
| Observations | 150 |
| Log Likelihood | -87.968 |
| Akaike Inf. Crit. | 183.937 |
| Note: | p<0.1; p<0.05; p<0.01 |
See how the glm() function, when we specify the
family = argument to “gaussian”, gives us
almost the same output as the lm()
function. This is because linear regression is based on the normal, or
Gaussian, distribution. What do you notice is different though?
stargazer::stargazer(glm(am ~ cyl + disp, data = mtcars,
family = "binomial"), type = "html")
| Dependent variable: | |
| am | |
| cyl | 0.743 |
| (0.719) | |
| disp | -0.028* |
| (0.014) | |
| Constant | 0.799 |
| (1.992) | |
| Observations | 32 |
| Log Likelihood | -14.293 |
| Akaike Inf. Crit. | 34.586 |
| Note: | p<0.1; p<0.05; p<0.01 |
If we want to use glm() to perform logistic regression,
we simply specify the distribution for which logistic regression
estimates the parameters, i.e., the binomial distribution.
Let’s quickly review what we use logistic regression for, and how it works. If we have an outcome that can only be true or false, yes or no, then it follows that our estimate for this outcome needs to be bounded between zero (for false/no) and one (for true/yes). It would make no sense to have an estimate less than zero, or more than 1; moreover, if we did get such an estimate, it would suggest that our model was somehow wrong, and that all our other predictions might be off.
Basically, we need a distribution that falls between zero and one, and we need to be able to link the coefficients of our predictors to this distribution. Here’s the distribution:
x <- 1:1000
plot(pbinom(x, size = 1000, prob = 0.2), type = "s", col = "blue",
main = "Binomial distribution function",
xlab = "Number of successes", ylab = "F(x)")
lines(pbinom(x, size = 1000, prob = 0.4), type = "s", col = "red")
lines(pbinom(x, size = 1000, prob = 0.6), type = "s", col = "green")
legend("bottomright",
legend = c("0.2", "0.4", "0.6"),
col = c("blue","red","green"),
title = "probability")
And here’s the link function:
\[\log\frac{p}{1-p} = \beta
X\]
Remember: all the link function does is convert from a probability (0 to 1 scale), which has a curve shape, to log odds (\(-\infty\) to \(\infty\) scale), which is a straight line. We typically use log odds for \(\beta X\) (literally, the log of the odds) because these are symmetrical and normally distributed. Taking the log of the odds also converts the terms in our model from a multiplicative to an additive relationship (i.e., we add the terms rather than multiply them), which is similar to what we are familiar with from linear regression.
It is easy to get confused in logistic regression. There is a difference between odds and odds ratio, log (odds) and log (odds ratio), the logit link function used to convert between probabilities and log odds, and the log likelihood transform used in maximum likelihood estimation to fit the predicted line. It takes time to familiarise yourself with these. In general, we will focus on how to interpret our model coefficients and how to make a prediction based on them.
Let’s compare a logistic regression model with a linear model we’re
familiar with, using the iris dataset we’re also familiar
with.
dat <- iris
dat$set <- ifelse(iris$Species == "setosa", 1, 0)
mod1 <- lm(set ~ Petal.Length + Petal.Width, data = dat)
stargazer::stargazer(mod1, type = "html")
| Dependent variable: | |
| set | |
| Petal.Length | -0.251*** |
| (0.032) | |
| Petal.Width | 0.010 |
| (0.073) | |
| Constant | 1.266*** |
| (0.044) | |
| Observations | 150 |
| R2 | 0.852 |
| Adjusted R2 | 0.849 |
| Residual Std. Error | 0.183 (df = 147) |
| F Statistic | 421.497*** (df = 2; 147) |
| Note: | p<0.1; p<0.05; p<0.01 |
Here, we ran a linear model on a binary outcome (is the iris of species setosa or not?) Let’s make a prediction from our model based upon some imaginary data: an iris with petal length of 5.4cm and width of 2.4cm
newdat <- data.frame(Petal.Length = 5.4,
Petal.Width = 2.4)
predict(mod1, newdat)
## 1
## -0.0675413
As we can see, our model gives us a prediction below zero - does this make sense?
Let’s try again with logistic regression.
mod2 <- glm(set ~ Petal.Length + Petal.Width, data = dat,
family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stargazer::stargazer(mod2, type = "html")
| Dependent variable: | |
| set | |
| Petal.Length | -17.596 |
| (43,449.070) | |
| Petal.Width | -33.887 |
| (115,850.800) | |
| Constant | 69.447 |
| (43,042.940) | |
| Observations | 150 |
| Log Likelihood | -0.000 |
| Akaike Inf. Crit. | 6.000 |
| Note: | p<0.1; p<0.05; p<0.01 |
So, we get a different set of coefficients for our predictor variables, both of which are now negative… what does this mean? How do we make a prediction? Let’s try with our same imaginary data.
predict(mod2, newdat)
## 1
## -106.8992
Well, that doesn’t seem too useful - we’re even further below zero
than the linear model. Except, if we check the help file for
predict.glm()…
Let’s try again with the argument type = "response":
predict(mod2, newdat, type = "response")
## 1
## 2.220446e-16
What’s the difference? By default we get the log odds prediction; if we want to get the probability from this, we need to reverse the link function, which is -
\[p =
\frac{e^{\log(odds)}}{1 + e^{\log(odds)}}\]
Let’s try this out on a slightly more normal range of data, petal length = 2.3 and petal width = 0.8.
newdat2 <- data.frame(Petal.Length = 2.3,
Petal.Width = 0.8)
predict(mod2, newdat2, type = "response")
## 1
## 0.8661015
log_odds <- predict(mod2, newdat2)
exp(log_odds)/(1+exp(log_odds))
## 1
## 0.8661015
1/(1+exp(-log_odds))
## 1
## 0.8661015
As you can see, we get the same prediction. Interpreting coefficients and predicted values in logit regression is less straightforward than linear regression. We will look into this in greater detail in the weeks to come, but this link has some helpful examples that explain the difference between log odds, odds, and probability, and how we can calculate each with R.
Before we move on, just to drive home the difference between linear and logistic regression, look at the plot below.
ggplot(data = dat, aes(Petal.Length, set)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red")
Here, we’re plotting the probability of an iris being from the setosa species according to the length of its petals. As we can see, the larger the petal, the lower the probability. However, whereas the linear regression line is not bounded by zero on the y axis, and so the line moves negative above roughly 5cm on the x axis, the logit regression is asymptotic to zero and one on the y axis: no matter how large the petal, the prediction does not go below zero.
optim()How exactly does R fit this line? We know how linear regression works: we just square the residuals and find the line that minimizes this number. It turned out there was actually a formula that found this solution without all the faff of experimenting:
\[b =
\frac{\sum(x-\overline{x})(y-\overline{y})}{\sum(x-\overline{x})^2}
\]
\[a = \overline{y}-b\overline{x} \]
Wouldn’t it be great if there was a similar equation for logistic regression? Sadly, there is not. Let’s watch a quick video that explains why.
So, if there’s no equation, how do we work out the best fitting line?
Let’s see how it works in R. We’ll begin by setting a seed for our random data generating functions:
set.seed(2022)
options(scipen = 500)
Next, we’ll create a vector of predictor variables using
rnorm():
x <- rnorm(10000) # 1000 random draws from the normal distribution with mean of 0 and
#sd of 1.
Now, we’ll set the coefficients we want to estimate:
intercept <- 3 # the coefficient of our intercept
slope <- 0.2 # the coefficient of our slope
And finally we want an outcome variable, y, which is binomial (1 or 0). We’re going to make the probability of a 1 or 0 dependent on a combination of the coefficients we just made up.
Because we’re using the binomial distribution, instead of the
parameters being mean and standard deviation, we have probability (check
?rbinom() for details.) This is the third argument, which
is what the MLE will be trying to maximise (estimate). Note the
equation, which is the reverse of our link function, just like we saw
above.
y_binom <- rbinom(10000, 1, exp(intercept + slope*x)/(1+exp(intercept + slope*x)))
# our outcome variable.
Let’s run the glm:
binom_glm <- glm(y_binom ~ x, family = "binomial")
stargazer::stargazer(binom_glm, type = "html")
| Dependent variable: | |
| y_binom | |
| x | 0.213*** |
| (0.047) | |
| Constant | 3.025*** |
| (0.048) | |
| Observations | 10,000 |
| Log Likelihood | -1,894.758 |
| Akaike Inf. Crit. | 3,793.516 |
| Note: | p<0.1; p<0.05; p<0.01 |
What’s going on under the hood here? We can use the
optim() function to find out. optim()
minimises (or maximises) a function. Here, we’ll supply the function we
used to generate y_binom, but with the two coefficients
missing. Compare the results with the glm() function.
f <- function(b) {
p <- exp(b[1] + b[2]*x)/(1+exp(b[1] + b[2]*x))
-sum(dbinom(y_binom, 1, p, log=TRUE))
}
ans <- optim(fn = f, par = 0:1, hessian=TRUE)
ans$par
## [1] 3.0254967 0.2126678
Right, I think that’s enough calculus for today. Next week we’ll look at using logistic regression on some real data.